3.3 Bayes Estimation for Frequentist

#BayesEstimation #ExponentialFamily #ConjugatePrior

1 Bayes Risk and Bayes Estimator

1.1 Frequentist Motivation

Consider model $P = {P_{θ} | θ \in Θ}$ for data $X$ . Loss $L (θ, d)$ , risk $R (θ; δ) = E_{θ} [L (θ, δ (X))]$ .
The Bayes risk is the average-case risk, integrated w.r.t some measure $Λ$ , called prior.
For now, assume $Λ (Θ) = 1$ (probability measure). Later we will allow it to be improper ( $Λ (Θ) = \infty$ ).

$Λ$ and $c Λ$ for $c > 0$ are functionally equivalent.

Average risk makes sense even if we don't "believe" $θ \sim Λ$ .

\begin{aligned} r_{Λ} (δ) & = \int_{Θ} R (θ, δ) d Λ (θ) \\ = E_{θ \sim Λ} [R (θ, δ)] = E_{θ \sim Λ} [E [L (θ, δ (X)) | θ]] \\ = E [L (θ, δ (X))] . \end{aligned}

(if we assume $θ \sim Λ, X | θ \sim P_{θ}$ ) $E$ is now mean w.r.t the joint distribution of $(θ, X)$ .

An estimator $δ$ minimizing $r_{Λ} (\cdot)$ is called a Bayes estimator. It depends on $P, Λ, L$ (by tower property again, $r_{Λ} (δ) = E [E [L (θ, δ (X)) | X]] .$ )

1.2 Prior and Posterior

Usual interpretation of $Λ$ is prior belief about $θ$ before seeing the data.
Conditional distribution $Λ (θ | X)$ is called posterior distribution: "belief after seeing the data".

Now we can explicitly define the densities:

Definition

Prior $λ (θ)$ .
Likelihood $p_{θ} (x)$ .
Joint density $λ (θ) p_{θ} (x)$ .
Marginal density $\int_{Θ} λ (θ) p_{θ} (x) d θ = q (x)$ .
Posterior density $λ (θ | x) = \frac{λ (θ) p_{θ} (x)}{q (x)}$ .

Bayes estimator depends on posterior:

δ_{Λ} (x) = \arg min_{d} E [L (θ, d) | x] = \arg min_{d} \int_{Θ} L (θ, d) λ (θ | x) d θ .

Solving Bayes estimator should be "one $x$ at a time".

Theorem

Suppose $X | θ \sim P_{θ}$ , $L (θ, d) \geq 0$ . $r_{Λ} (δ_{0}) < \infty$ for some $δ_{0} (X)$ . Then $δ_{Λ} (x)$ is Bayes with $r_{Λ} (δ_{Λ}) < \infty$ iff $δ_{Λ} (x) \in \arg min_{d} E [L (θ, d) | X = x], a . e . x$

Proof

" $\Rightarrow$ ": let $δ$ be any other estimator. Then $\begin{aligned} r_{Λ} (δ) = E [E [L (θ, δ (X)) | X = x]] \\ \geq & E [E [L (θ, δ_{Λ} (x)) | X = x]] = r_{Λ} (δ_{Λ}) . \end{aligned}$
" $\Leftarrow$ ": Define $E_{x} (d) = E [L (θ, d) | X = x]$ . Let $δ^{*} (x) = {\begin{aligned} δ_{Λ} (x), δ_{Λ} (x) \in \arg min_{d} E_{x} (d), \\ δ_{0} (x), E_{x} (δ_{0} (x)) < E_{x} (δ_{Λ} (x)), \\ d^{*} (x), E_{x} (d^{*}) < E_{x} (δ_{Λ} (x)) . \end{aligned}$
Then $E_{x} (δ^{*} (x)) \leq min (E_{x} (δ_{0} (x)), E_{x} (δ_{Λ} (x))), \forall x$ , with inequality strict on a set of measure $> 0$ .

2 Posterior Mean

2.1 Square Error Loss

If $L (θ, d) = (g (θ) - d)^{2}$ , then the Bayes estimator is the posterior mean: $\begin{aligned} E [(g (θ) - d)^{2} | X] & = E [(g (θ) - E [g (θ) | X] + E [g (θ) | X] - d)^{2} | X] \\ = Var (g (θ) | X) + (E [g (θ) | X] - d)^{2}, \end{aligned}$ so $δ_{Λ} (X) = E [g (θ) | X]$ .

2.2 Weighted Square Error

If $L (θ, d) = w (θ) (g (θ) - d)^{2}$ (like ${(\frac{θ - d}{θ})}^{2}$ ), then $E [(d - g (θ))^{2} w (θ) | X] = d^{2} E [w (θ) | X] - 2 d E [w (θ) g (θ) | X] + E [w (θ) g (θ)^{2} | X],$ which is minimized at $d = \frac{E [w (θ) g (θ) | X]}{E [w (θ) | X]}$ .

2.3 Other Examples

Example (Beta-Binomial)

$X | θ \sim Binomial (n, θ) = θ^{x} (1 - θ)^{n - x} (\binom{n}{x})$ , and prior $θ \sim Beta (α, β) = θ^{α - 1} (1 - θ)^{β - 1} \frac{Γ (α) Γ (β)}{Γ (α + β)}$ , with gamma function $Γ (α) = \int_{0}^{+ \infty} x^{α - 1} e^{- x} d x$ . The posterior is $\begin{aligned} λ (θ | x) & = \frac{λ (θ) p_{θ} (x)}{q (x)} \propto_{θ} θ^{α - 1} (1 - θ)^{β - 1} θ^{x} (1 - θ)^{n - x} \\ = θ^{x + α - 1} (1 - θ)^{n - x + β - 1}, \end{aligned}$ so $θ | X = x \sim Beta (x + α, n - x + β)$ , so $E [θ | X] = \frac{x + α}{n + α + β} = \frac{x}{n} \cdot \frac{n}{n + α + β} + \frac{α}{α + β} \cdot \frac{α + β}{n + α + β} .$

$k = α + β$ "pseudo-trials" with $α$ successes.

We can see from several examples that posterior mean is weighted sum of sample mean and prior mean. When $n$ is huge, the weight of sample is huge.

Example (Normal Mean)

$X | θ \sim N (θ, σ^{2}) \propto_{θ} e^{- \frac{(x - θ)^{2}}{2 σ^{2}}}$ , $θ \sim N (μ, τ^{2}) \propto_{θ} e^{- \frac{(θ - μ)^{2}}{2 τ^{2}}}$ . So $\begin{aligned} λ (θ | x) & \propto_{θ} \exp {- \frac{(x - θ)^{2}}{2 σ^{2}} - \frac{(θ - μ)^{2}}{2 τ^{2}}} \\ \propto_{θ} \exp {\frac{x θ}{σ^{2}} - \frac{θ^{2}}{2 σ^{2}} - \frac{θ^{2}}{2 τ^{2}} + \frac{θ μ}{τ^{2}}} \\ = \exp {θ (\frac{x}{σ^{2}} + \frac{μ}{τ^{2}}) - θ^{2} (\frac{σ^{- 2} + τ^{- 2}}{2})} \\ \propto_{θ} \exp {- \frac{1}{2} {(θ - \frac{x σ^{- 2} + μ τ^{- 2}}{σ^{- 2} + τ^{- 2}})}^{2} (σ^{- 2} + τ^{- 2})} \\ \propto_{θ} N (\frac{x σ^{- 2} + μ τ^{- 2}}{σ^{- 2} + τ^{- 2}}, \frac{1}{σ^{- 2}, τ^{- 2}}), \end{aligned}$ so $E [θ | X] = X \cdot \frac{σ^{- 2}}{σ^{- 2} + τ^{- 2}} + μ \cdot \frac{τ^{- 2}}{σ^{- 2} + τ^{- 2}} .$

Example (Gaussian iid Sample)

$θ \sim N (μ, τ^{2}), X_{i} | θ \overset{i . i . d}{\sim} N (θ, σ^{2})$ , $i = 1, \dots, n$ . It is commonly known that $\overset{―}{X} \sim N (θ, \frac{σ^{2}}{n})$ . So by results above, $E [θ | X] = X \cdot \frac{n σ^{- 2}}{n σ^{- 2} + τ^{- 2}} + μ \cdot \frac{τ^{- 2}}{n σ^{- 2} + τ^{- 2}} .$

$k = \frac{σ^{2}}{τ^{2}}$ is pseudo-observations with mean $μ$ . If $n ≫ k$ , data swamps prior. If $n ≪ k$ , prior swamps data.

If the posterior is from the same family as the prior, we say that prior is conjugate to the likelihood. This is most common in exponential families.

3 Conjugate Priors

Suppose $X_{i} | η \overset{i . i . d}{\sim} p_{η} (x) = e^{η^{T} T (x) - A (η)} h (x), η \in Ξ \subset R^{s}$ , $i = 1, \dots, n$ .
For any carrier $λ_{0} (η)$ , define $(s + 1)$ -dim family $λ_{μ, k} (η) = e^{k μ^{T} η - k A (η) - B (μ, k)} λ_{0} (η),$ so sufficient statistic is $(\begin{matrix} η \\ - A (η) \end{matrix})$ , and natural parameter is $(\begin{matrix} k μ \\ k \end{matrix})$ .
So $\begin{aligned} λ (η | x_{1}, \dots, x_{n}) & \propto_{η} (\prod_{i = 1}^{n} e^{η^{T} T (x_{i}) - A (η)} h (x_{i})) \cdot e^{k μ^{T} η - k A (η) - B (k μ, k)} λ_{0} (η) \\ \propto_{η} e^{{(k μ + \sum_{i = 1}^{n} T (x_{i}))}^{T} η - (k + n) A (η)} λ_{0} (η) \\ = λ_{μ_{post}, n + k} (η), \end{aligned}$ where $μ_{post} = \frac{k μ + n \overset{―}{T}}{k + n}, \overset{―}{T} (X) = \frac{1}{n} \sum_{i = 1}^{n} T (X_{i})$ . Then $μ_{post} = \underset{UMVUE from data}{\underset{⏟}{\overset{―}{T} \cdot \frac{n}{k + n}}} + \underset{“UMVUE” from “pseudo data”}{\underset{⏟}{μ \cdot \frac{k}{k + n}}} .$

3.1 Conjugate Prior Examples

Likelihood	Prior
$X_{i} ∣ θ \sim Binomial (n, θ)$	$θ \sim Beta (α, β)$
$X_{i} ∣ θ \sim N (θ, σ^{2})$	$θ \in N (μ, τ^{2})$
$X_{i} ∣ θ \sim Poisson (θ)$	$θ \sim Gamma (ν, s)$